Planning Stage: Data Description & Exploratory Data Analysis - Individual Assignment 1¶

(1) Data Description

Our dataset can is from https://www.kaggle.com/datasets/ruchikakumbhar/placement-prediction-dataset .

It describes whether a student will be placed into a job or not based on various factors including their academics and previous job training.

It's not stated how this data was collected.

  • Number of Observations - 10000
  • Number of Variables - 12

Descriptions of the Cells:

  • StudentID - Quantitative - A variable indicating the students ID number.
  • CGPA - Quantitative - The overall CGPA grade achieved by the student.
  • Internships - Quantitative - The number of internships a student has done.
  • Projects - Quantitative - The number of projects a student has done.
  • Workshops/Certificates - Quantitative - The number of workshops and certificates a student has done.
  • AptitudeTestScore - Quantitative - The score of a student on the aptitude test.
  • SoftSkillsRating - Quantitative - The rating of a student of their soft skills.
  • ExtracurricularActivities - Categorical - Whether or not the student has extracurricular activities.
  • PlacementTraining - Categorical - Whether or not the student has received placement training.
  • SSC_Marks - Quantitative - The students senior secondary school marks.
  • HSC_Marks - Quantitative - The students higher secondary school marks.
  • PlacementStatus - Categorical - Whether or not the student was placed.

Pre-Selection of Variables:

StudentID will be dropped because the ID number of the student isn't relevant to any job placement. The other variables could be used in analysis so they won't be dropped.

(2) Question

Question for Assignment 1: The question that I want to answer from the dataset is How are CGPA, Aptitude Test Score, and Internship experience associated with the likelihood of job placement?

Question changed for Assignment 2: The question that I want to answer from the dataset is "How do all of the variables in the dataset minus StudentID associate with the likelihood of job placement?" This is changed to be able to use a model with all of the variables to not remove any potentially useful variables.

My question is focused on prediction. This is because it is focusing on predicting whether the variables that I chose determine if a student does or doesn't get a job placement, which is the response variable.

(3) Exploratory Data Analysis and Visualization

In [1]:
# Imported libraries needed for code

library(tidyverse)
library(broom)
library(ggplot2)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In [2]:
# Read data, filtered for NA values, and then displayed head

placement_data <- read_csv("https://raw.githubusercontent.com/icyfrostbolt/placement-data/main/placementdata.csv")
placement_data <- na.omit(placement_data)

head(placement_data)
Rows: 10000 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): ExtracurricularActivities, PlacementTraining, PlacementStatus
dbl (9): StudentID, CGPA, Internships, Projects, Workshops/Certifications, A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 12
StudentIDCGPAInternshipsProjectsWorkshops/CertificationsAptitudeTestScoreSoftSkillsRatingExtracurricularActivitiesPlacementTrainingSSC_MarksHSC_MarksPlacementStatus
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr><dbl><dbl><chr>
17.5111654.4No No 6179NotPlaced
28.9032904.0YesYes7882Placed
37.3122824.8YesNo 7980NotPlaced
47.5112854.4YesYes8180Placed
58.3122864.5YesYes7488Placed
67.0022714.2YesNo 5566NotPlaced
In [3]:
# Created plot to visualize the data

options(repr.plot.width = 12, repr.plot.height = 7)

plot <- ggplot(placement_data, aes(x = AptitudeTestScore, y = CGPA, color = PlacementStatus, size = Internships)) +
    geom_point(alpha = 0.7) +
    labs(
        title = "CGPA vs Aptitude Test Score with Placement Status",
        x = "Aptitude Test Score",
        y = "CGPA",
        color = "Placement Status",
        size = "Number of Internships",
    ) +
    scale_size_continuous(
        breaks = c(0, 1, 2)
    )

plot
No description has been provided for this image

This plot is relevant to my question because it plots both of the numeric values as well as the response variable that I am looking at in my question against each other and also shows the placement status of their values. I chose these variables because they were numeric and appear like they should be correlated to the placement status based on the description of them. I also chose internships because I was wondering if previous internships would lead to more internships. This is very helpful because based on the graph it appears that a higher CGPA and higher AptitudeTestScore tend to lead to more placements. The lower-left corner appears to mostly have people that were not placed while the upper-right corner has mostly people who were placed. The size also shows that most of the people in the dataset have internships already and that even people who do have previous internships get rejected.

Methods and Plan & Computational Code and Output - Individual Assignment 2¶

(1) Methods and Plan¶

I'm planning on using Logistic Regression, which is appropriate because it is intended for models which need to solve a binary classification problem, which is what my question is as it intends to see if various variables affect placement status. The assumptions required for this method are that the relationship between the variables are linear, the data is independent, and that there is no multicollinearity. Potential weaknesses are that it is sensitive to outliers, cannot solve non-linear problems, and is prone to overfitting.

(2) Computational Code and Output¶

In [4]:
# Changed the PlacementStatus variable into numbers to be used in logistic regression

placement_data <- placement_data %>%
    mutate(PlacementStatus = if_else(PlacementStatus == "Placed", 1, 0))

# Split data into train and test sets

placement_train <- placement_data %>% 
    slice_sample(prop = 0.7)

placement_test <- placement_data %>% 
    anti_join(placement_train)

# Created logistic model with training data and displayed tidied results in table

lm_placement <- glm(formula = PlacementStatus ~ . -StudentID, 
                    data = placement_train,
                    family = binomial)

tidy(lm_placement)
Joining with `by = join_by(StudentID, CGPA, Internships, Projects,
`Workshops/Certifications`, AptitudeTestScore, SoftSkillsRating,
ExtracurricularActivities, PlacementTraining, SSC_Marks, HSC_Marks,
PlacementStatus)`
A tibble: 11 × 5
termestimatestd.errorstatisticp.value
<chr><dbl><dbl><dbl><dbl>
(Intercept) -17.405308050.626182236-27.7959154.860092e-170
CGPA 0.361483770.059725395 6.052430 1.426770e-09
Internships 0.067521280.050481540 1.337544 1.810452e-01
Projects 0.282108210.045113401 6.253313 4.018364e-10
`Workshops/Certifications` 0.119217570.037735445 3.159299 1.581489e-03
AptitudeTestScore 0.068244940.005578539 12.233480 2.059238e-34
SoftSkillsRating 0.620946570.100229274 6.195262 5.818829e-10
ExtracurricularActivitiesYes 0.708758030.080272152 8.829439 1.052024e-18
PlacementTrainingYes 0.921066470.086013069 10.708448 9.290636e-27
SSC_Marks 0.026799670.003769967 7.108729 1.171167e-12
HSC_Marks 0.029418180.004533955 6.488415 8.674405e-11

The logistic regression results show that CGPA, Projects, Workshops/Certifications, AptitudeTestScore, SoftSkillsRating, ExtracurricularActivities, PlacementTraining, SSC Marks, and HSC Marks all have a positive impact on the probability of job placement based on their coefficients. Notably, the coefficient for the number of internships is below zero, implying that it does not significantly predict placement in this model. All variables demonstrate statistical significance with p-values below 0.05, indicating that their estimates are good predictors. These findings align with the expectations, which is the idea that strong academic performance, skills, and training improve the likelihood of getting an internship.

In [ ]: